home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power Programmierung
/
Power-Programmierung (Tewi)(1994).iso
/
magazine
/
drdobbs
/
ddjcompr
/
sixpack
/
sixpack.doc
< prev
Wrap
Text File
|
1991-04-20
|
9KB
|
283 lines
Philip G. Gage
5345 El Camino Drive
Colorado Springs, CO 80918
Home: 719-593-1801
Work: 719-570-8089
CIS: 70541,3645
INTRODUCTION
The Sixpack program is a file compression utility using a string copy
algorithm and adaptive Huffman encoding. The program was written
specifically for the Data Compression contest announced in the February 1991
issue of Dr. Dobb's Journal, based on earlier compression programs that I
have developed over the past few years. The goal was to achieve maximum
compression on a 640K PC, even if it ran slowly.
The main disadvantage is slow compression time, since the algorithm
repeatedly searches the last few thousand bytes looking for the longest
string that matches the current text. Decompression is faster, involving no
text searching. The compression speed can be adjusted somewhat by changing
the search parameters.
The whimsical name Sixpack was chosen because the program combines six
algorithms into a single data packing method. The algorithms illustrate a
variety of data structures, including a binary tree, a hash table, doubly
linked lists and a circular array. I must admit that integrating all these
concepts into a working program was quite educational. A brief description
of each algorithm follows.
FINITE WINDOW COMPRESSION
The basic idea is to maintain a "finite window" buffer of the most recent
few thousand characters and search this buffer for the longest string
matching the current text. If such a string is found and it meets or
exceeds a minimum length, then compression can be achieved by encoding the
matching section of the current text as the number of characters to copy and
the distance from which to copy them. If no matching string of the minimum
length or longer is found, the current character is output as a literal
without compression and the algorithm proceeds to the next input character.
This finite window scheme generates two types of codes, single literal
characters, and string copies consisting of length and distance values. To
avoid useless copy length/distance pairs, the distance is measured from the
last character of the string to be copied instead of the first character.
Several distance formats with a different number of bits are used to
minimize the distance code size. Another enhancement is not to issue a copy
if a better copy exists at the next character. A final improvement is to
check for an alphabetized dictionary word file and restrict copies to
roughly a one word distance on such files for greater compression.
This algorithm is more similar to the original Lempel-Ziv approach than to
the later LZW implementation, and resembles methods described in "Data
Compression with Finite Windows", Communications of the ACM, April 1989.
The original Lempel-Ziv idea combines each copy with a literal character,
while the ACM article uses blocks of literal characters. The well known
LHARC/ICE program uses a similar method to achieve impressive results.
CIRCULAR BUFFER ARRAY
The first problem is how to store the buffer of recent text. To maintain a
queue using a linked list would complicate searching. Shifting the entire
contents of an array to add a new character would be too slow.
The buffering technique used in Sixpack is to store the text in a circular
array which wraps around on itself. When the end of the array is reached,
the position is reset to the beginning of the array and old text is
overwritten. No additional data structures are needed and the array occupies
minimum space.
Since character positions are fixed during their lifetime in the array, the
linked lists described later can be allocated in parallel with the buffer
array, using the character positions as the corresponding linked list node
numbers. The disadvantage of this method is that all operations involving
text strings in the buffer must account for the wrap at the end of the
buffer.
HASH TABLE
The fundamental problem is finding the longest string match in a large block
of text. A brute force search gives very slow performance. Several search
algorithms were tested, including a binary search tree, a direct lookup
table and fast text search techniques. For this application, the best
method seemed to be a hash table where the key is derived from the first few
characters at each location in the buffer.
Each entry in the hash table is a doubly linked list containing the indices
of all buffer positions with matching text. Each list requires both a head
and tail pointer. Since several string prefixes may generate the same hash
key, some collisions may occur and the entire string must be checked during
a search.
DOUBLY LINKED LISTS
Linked lists are efficient for storing string prefixes in the hash table,
since the prefixes are continually being deleted when they reach the maximum
search distance and many duplicate keys may exist in some lists. Hash table
chaining techniques would be awkward in this situation.
Both successor and predecessor pointers must be kept for a doubly linked
list. New nodes are added at the head of the list and old nodes are deleted
at the tail of the list. A singly linked list would result in slow delete
times, since the entire list must be scanned to find the last node.
Searching begins at the head of the list, keeping track of the best matching
string seen so far. This method has the advantage that the most recent
string match is always found, resulting in shorter distance copies that can
be encoded in fewer bits. No actual information needs to be stored in the
lists, because the node pointers also indicate the character positions in
the buffer.
ADAPTIVE HUFFMAN CODING
As a final compression stage, each literal character and copy length code is
passed through an adaptive frequency filter which squeezes frequently
occurring characters into short bit strings. The possible copy length codes
for each distance range are added to the end of the normal character set.
The copy distance values are likely to be more random and not susceptible to
frequency encoding, so they are output using fixed length bit strings.
A binary prefix code tree which approximates the famous Huffman tree is
maintained by counting each character and propagating the count upward
through the tree. During this propagation the frequency of each node is
calculated as the sum of the frequencies of both children. The new
frequency of each traversed node is then compared to that of the node that
is up two levels and down one. If the higher frequency is lower in the
tree, the two nodes are swapped. To avoid overflow and provide local
adaption to changing data, the frequencies are periodically scaled down by a
factor of two.
The data structures and compress/uncompress routines are derived from Pascal
versions presented in "Application of Splay Trees to Data Compression",
Communications of the ACM, August 1988. I have replaced the original
semi-splaying by frequency coding, giving better results for this
application but running slower.
BIT PACKING
The final topic to be covered is packing and unpacking of variable length
bit strings. Several different sizes of codes are used for copy distance
values, while the adaptive Huffman algorithm processes individual bits.
Routines to handle single bits and multibit codes are used in the program.
A flush routine writes any buffered bits to the output file before it is
closed.
SUMMARY
In summary, the Sixpack program provides very good compression with low
memory usage, about 200K for compression and 50K for decompression. The
code is fairly simple and generates an executable file only 14K in size. It
uses a one pass method suitable for large files and redirected data streams.
The main disadvantage is slow compression speed, making it more suitable for
archive distribution than for routine backups. There is much room for
performance improvement, making this a potentially useful method.